ReCo: automated NGS read-counting of single and combinatorial CRISPR gRNAs

Abstract Summary CRISPR screens are increasingly performed to associate genotypes with genotypes. So far, however, their analysis required specialized computational knowledge to transform high-throughput next-generation sequencing (NGS) data into sequence formats amenable for downstream analysis. We developed ReCo, a stand-alone and user-friendly analytics tool for generating read-count tables of single and combinatorial CRISPR library and screen-based NGS data. Together with cutadapt and bowtie2 for rapid sequence trimming and alignment, ReCo enables the automated generation of read count tables from staggered NGS reads for the downstream identification of gRNA-induced phenotypes. Availability and implementation ReCo is published under the MIT license and available at: https://github.com/KaulichLab/ReCo.


Introduction
The CRISPR-Cas system has emerged as an important tool for genome editing (Jinek et al. 2012, Cong et al. 2013, Wang and Doudna 2023. In its engineered version, the system consists of two components, a Cas endonuclease and a single gRNA (sgRNA) that guides the Cas enzyme to a predefined locus in the genome. Depending on the type of system, the targeted locus can be perturbed in multiple ways, among them the induction of double-strand breaks (causing insertions or deletions, InDels), editing of individual bases (base or prime editing) (Anzalone et al., 2020), or recruiting effector domains to activate or repress gene transcription (CRISPRi or CRISPRa) (Gilbert et al. 2014, Liu et al. 2022. In its most widely used form, a Cas nuclease, e.g. SpCas9, induces a DNA double-strand break in coding exons resulting in frameshift mutations that cause functional knockouts of the genes of interest. When target sites are bundled in a gRNA library, a population of mutant cells can be generated and screened for a phenotype of interest, enabling unbiased genotype-tophenotype associations (Shalem et al. 2014, Wang et al. 2014, Bock et al. 2022. To do so, the gRNA expression cassette is stably integrated into the host cell genome which allows its population frequency to be used as a surrogate for the gRNA-induced phenotype (Shalem et al. 2014, Wang et al. 2014, Ford et al. 2019. gRNA frequencies are quantified by NGS, comparing different screening time points with their gRNA library frequency. Due to the low sequence diversity of gRNA libraries (only the gRNA part of the NGS-read is variable), gRNA amplicons are commonly sequenced with staggered oligos, rendering the gRNA position random within a window of up to eight nucleotides, which avoids low diversity issues during NGS runs. This, however, prevents the extraction of gRNA sequences from NGS reads in which the gRNA position is fixed which requires additional read trimming and alignment steps for data processing. Although this setup is widely used, there is a lack of automatic pipelines to generate gRNA read count tables from staggered NGS data that enable computationally less developed groups to analyze their CRISPR libraries and screening samples. Closing this gap, we present the Read Counting tool ReCo that automatically generates read count tables from single-end and pairedend NGS fastq files with minimal input requirements.
ReCo is implemented as a Python 3 package that can also be run as a standalone command line tool. It uses the parallelization capabilities of two external tools, cutadapt and bow-tie2 (Martin 2011, Langmead andSalzberg 2012), to decrease sample processing time. ReCo can process arbitrary numbers of single-end and paired-end samples per run, corresponding to single or combinatorial CRISPR gRNA libraries. The tool requires minimal information per sample, but a unique sample name, as well as fastq and gRNA library file locations. Optionally, ReCo integrates expected sequencing depths and accepts vector maps in SnapGene format to account for 3Cstechnology-based samples (Wegner et al. 2019, Diehl et al. 2021. If provided with a vector file, ReCo will automatically find the 3Cs-template sequences and report their abundance in the final report.

Benchmarking
To assess the relative performance of ReCo, benchmarking against PinAPL-py was performed (Spahn et al. 2017).
PinAPL-py was chosen as it is the only other available tool to operate on staggered NGS reads. Moreover, benchmarking was limited to single-end NGS reads, as PinAPL-py does not accept paired-end NGS reads. To assess their relative performance, we used the test data set provided by PinAPL-py, that are derived from a genome-wide CRISPR-Cas knockout screen using the SpCas9 Brunello gRNA library in a drop-out screen in A375 melanoma cells, containing 67.9 million reads (Doench et al. 2016). We separated the benchmarking in two aspects, the number of found gRNAs and their associated alignment rates, and the required run time. Alignment rates and the number of found gRNAs were similar with 83.88% and 83.91%, and 98.58% (76 341 of 77 441) and 98.62% (76 372 of 77 441) for PinAPL-py and RecCo, respectively. However, we found an issue within PinAPL-py's alignment parameters which resulted in the failure to detect 31 gRNA sequences that are the reverse complement of other gRNAs, an issue that does not occur with ReCo. To benchmark the required run time, sampled datasets corresponding to 500K, 1M, 5M, 10M, 25M, 50M, 100M, 200M, 300M, 600M, and 1.2B reads were derived from the original test data and processed individually with no other jobs running by PinAPL-py and ReCo on 15 cores to maximize parallelization and ensure a fair comparison. While the run time of PinAPL-py grew exponentially with sample size and was dominated by trimming, alignment, and mapping/counting, the run time of ReCo was determined solely by trimming, with alignment and mapping/counting being decoupled from sample size (Fig. 1b). With increasing sample size, the ratio between the required run time for PinAPL-py and ReCo increased (Fig. 1c), demonstrating that ReCo scales better with samples of high diversity, such as combinatorial libraries or multiple diverse samples.

Conclusions
ReCo is a scalable read-counting tool for single and combinatorial CRISPR gRNA library data. It automatically recognizes gRNA positions in staggered single and paired-end NGS reads, generates read count files for further data analysis, and provides a visual quality control report summarizing the percentage of Benchmarking of running times for ReCo and PinAPL-py. Sequencing samples of sizes between 0.5 million and 1.2 billion reads were processed on 15 cores and the time that was required for the individual steps was measured in seconds and is shown for each sample. While in the PinAPL-py algorithm, the time requirements grew for each processing step, in the ReCo algorithm, only the trimming procedure required more time in relation to the number of input reads. (c) The ratio of ReCo and PinAPL-py running times increases with the number of processed reads, meaning that the time requirements for PinAPL-py increase faster than those for ReCo.

2
M. Wegner and M. Kaulich aligned and trimmed reads, expected and obtained sequencing depth, as well as gRNA and sample distribution skew. Combined with downstream CRISPR analysis tools, experienced and inexperienced users can efficiently analyze the effects of gRNAs/gene phenotypes across diverse CRISPR screen conditions.